The dataset (https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) we have chosen is found on Kaggle and contains information about age, gender, bmi (and 7 other parameters) along with whether or not the patient has had a stroke. Each row in the dataset provides relevant information about each of the 5110 patients. For each patient, there are 12 attributes:
Unique id assigned to each patient;
Gender, which can be either male or female;
Age expressed in years;
Hypertension which takes in the value of 1 (have had) or 0 (never had);
Heart disease which takes in the value of 1(have had) or 0 (never had);
Whether the patient has been married before (true) or not (false);
Work type which has several options: self-employed, private, government job, children, never worked;
Type of residence: rural or urban;
Average glucose level in blood (in units);
The BMI (body mass index), measured in kg/m^2;
Smoking status ranging from ‘unknown’ to ‘never smoked’ to ‘smokes’;
Whether the patient has previously had a stroke (1) or not (0);
The location where these data were collected, as well as the period of collection, remain undisclosed.
The purpose of collection has to do with the fact that stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths, according to the World Health Organization (WHO).
The dataset is under a licence called “Data files © Original Authors”, meaning that the dataset does not belong to the original uploader. The original source, as stated on Kaggle, is confidential and the use of this dataset is restricted to educational purposes only.
# Reading in the data
stroke <- read.csv("healthcare-dataset-stroke-data.csv")
# Transforming bmi from character to numeric
stroke$bmi <- as.numeric(stroke$bmi)
## Warning: NAs introduced by coercion
# DATA CLEANING
sum(is.na(stroke$bmi))
## [1] 201
# Since we only have 201 observations for bmi that are NA
# (which is just 201/5110 * 100 = 3.93% of the total data entries),
# we can then safely omit these rows when cleaning the data,
stroke <- na.omit(stroke)
# we have 4909 valid observations in the stroke dataset
Outline your main research question.
Research Question: What are the variables that best allow us to predict stroke for an individual? Is it possible to predict stroke using these variables?
Do some exploratory data analysis to tell an interesting story about your data. Instead of limiting yourself to relationships between just two variables, broaden the scope of your analysis and employ creative approaches that evaluate relationships between more than two variables.
We have done EDA in order to respond to the following subsidiary EDA questions about the dataset:
a. The dataset contains mostly binary variables. Is the dataset balanced with respect to these binary variables?
# GENDER
barplot(table(stroke$gender),
ylim = c(0, 3000),
col = c("pink", "blue", "gray"),
main ="Barplot of Gender")
# Roughly 2000 males and 3000 females and very few labled as "other"
# HYPERTENSION
barplot(table(stroke$hypertension),
ylim = c(0, 5000),
col = c("green","red"),
names = c("no hypertension", "hypertension"),
main ="Barplot of Hypertension")
# A lot more people without hypertension that with hypertension, as would be expected in a normal population.
# HEART_DISEASE
barplot(table(stroke$heart_disease),
ylim = c(0, 5000),
col = c("green","red"),
names = c("no heart disease", "heart disease"),
main ="Barplot of Heart Disease")
# Again, a lot more people without heart disease than with heart disease, which is a reasonable result.
# EVER_MARRIED
barplot(table(stroke$ever_married),
ylim = c(0, 4000),
col = c("red", "green"),
main = "Barplot of Ever Married")
# We have more married people than not married, but the importance of this variable is unclear.
# RESIDENCE_TYPE
barplot(table(stroke$Residence_type),
ylim = c(0, 3000),
col = c("green", "gray"),
main = "Barplot of Residence Type")
# The residence type is pretty even, half urban and half rural.
# SMOKING STATUS
barplot(table(stroke$smoking_status),
ylim = c(0, 2000),
cex.names = 0.71,
col = c("orange", "green", "red", "gray"),
main = "Barplot of Smoking Status")
# Most people in this dataset either never smoked or their status is unknown.
# Less than 2000 people smoke or formerly smoked.
# Finally, our main variable of interest:
# STROKE:
barplot(table(stroke$stroke),
ylim = c(0, 5000),
col = c("green", "red"),
names= c("no stroke", "stroke"),
main = "Barplot of Stroke")
# This dataset seems to be a bit unbalanced in terms of how many people have had a stroke or not.
# A lot more people have not ever suffered from a stroke than those who have. This is consistent with reality.
b. Is there any natural segmentation between people who have had a stroke or not based on their lifestyle habits (e.g. Bmi, glucose level etc)?
We can do K-means clustering (in 3D) with the continuous variables :
library(ggpubr)
## Loading required package: ggplot2
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
select <- na.omit(stroke[, c("bmi", "age", "avg_glucose_level")])
# SCALING VARIABLES
select$bmi <- scale(select$bmi)
select$age <- scale(select$age)
select$avg_glucose_level <- scale(select$avg_glucose_level)
select_matrix <- as.matrix(select)
fviz_nbclust(select_matrix, kmeans, method = "wss") + geom_vline(xintercept = 4, linetype = 2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
select$cluster = factor(kmeans(select,4)$cluster)
p <- plot_ly(select, x=~bmi, y=~age,
z=~avg_glucose_level, color=~cluster) %>%
add_markers(size=1.5)
htmltools::tagList(list(p))
There don’t seem to be any prominent clusters of data, so there is no natural clustering between the participants of this study.
c. Is there any significant linear relationship between the continuous variables in this dataset (age, bmi, avg_glucose level)?
We can try to compute a scatterplot matrix in order to observe the relationships between the continuous variables in this dataset.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
stroke_continuous <- stroke[, c("age","bmi","avg_glucose_level")]
stroke_continuous <- na.omit(stroke_continuous)
ggpairs(stroke_continuous,ggplot2::aes(colour="red"))
There are no obvious strongly linear relationships between the continuous variables bmi, average glucose level and age. Thus, when we compute the model with stroke as dependent variable, we will not have to worry about multicollinearity. To be sure of that, let us compute the correlation matrix with the other numerical predictors included:
stroke_1 <- stroke[, -c(1,2,6,7,8,11)]
cor(stroke_1)
## age hypertension heart_disease avg_glucose_level
## age 1.0000000 0.2744249 0.25712278 0.2358382
## hypertension 0.2744249 1.0000000 0.11599099 0.1805427
## heart_disease 0.2571228 0.1159910 1.00000000 0.1545251
## avg_glucose_level 0.2358382 0.1805427 0.15452512 1.0000000
## bmi 0.3333980 0.1678106 0.04135744 0.1755022
## stroke 0.2323309 0.1425146 0.13793779 0.1389359
## bmi stroke
## age 0.33339800 0.23233086
## hypertension 0.16781058 0.14251461
## heart_disease 0.04135744 0.13793779
## avg_glucose_level 0.17550218 0.13893586
## bmi 1.00000000 0.04237366
## stroke 0.04237366 1.00000000
The strongest correlated variables seem to be bmi and age, and then age and hypertension. We can visualize these relationships more nicely in the following way:
library(GGally)
ggcorr(stroke[,-1])
## Warning in ggcorr(stroke[, -1]): data in column(s) 'gender', 'ever_married',
## 'work_type', 'Residence_type', 'smoking_status' are not numeric and were ignored
We see that the highest correlation of stroke with another variable is that with age, even though they are only mildly (positively) correlated.
d. Are older people/people with higher bmis/people with higher average glucose level more likely to suffer from stroke?
boxplot(age ~ stroke,
data = stroke,
main = "Boxplot of Age vs Stroke",
col = c("green", "yellow"))
From this side-by-side boxplot, it seems like older people suffer from stroke more often than younger people, as we would expect in real life.
boxplot(bmi ~ stroke,
data = stroke,
main = "Boxplot of bmi vs Stroke",
col = c("lightblue", "blue"))
There does not seem to be a significant difference in the bmis of people who have had a stroke in the past or not, even though the distribution of the bmis of people who have not ever had a stroke contains far more outliers on the higher end than the distribution of the bmis of people who have had a stroke in the past.
boxplot(avg_glucose_level ~ stroke,
data = stroke,
main = "Boxplot of Average Glucose Level vs Stroke",
col = c("pink", "red"))
It seems that, in general, the people that have higher average blood glucose levels tend to suffer from stroke more often than people who have lower glucose levels.
Use one of your research questions, or come up with a new one depending on feedback from the proposal. If applicable, perform a hypothesis test to answer your question. If you do, make sure to check any applicable assumptions. If you do not use a hypothesis test, use other means to show what answer the data provide.
Since our main research question is “What are the variables that best allow us to predict stroke for an individual?”, we will try to do model selection using three methods:
Lasso Regression with the most appropriate value for lambda
Stepwise regression based on p-values
Backward Elimination
Forward Selection
Lasso Regression for Model Selection
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1
# Generating model matrix
X <- model.matrix(stroke ~ gender + age + hypertension + heart_disease + ever_married + work_type + Residence_type + avg_glucose_level + bmi + smoking_status, data = stroke)[,-1]
# `[,-1]` removes the first column: The first column corresponds to an intercept.
# `glmnet` also adds an intercept, therefore we remove it here.
# Finding the best value for lambda
set.seed(1)
cv_result <- cv.glmnet(x = X, y = stroke$stroke, family = "binomial", keep = TRUE)
plot(cv_result)
cat("The value of lambda the yielded the lowest mean-squared error:", cv_result$lambda.min)
## The value of lambda the yielded the lowest mean-squared error: 0.001500674
# The value of lambda the yielded the lowest mean-squared error: 0.001500674
lasso_fit <- glmnet(x = X, y = stroke$stroke, lambda = 0.001500674, family = "binomial")
coef(lasso_fit)
## 17 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -7.522473918
## genderMale .
## genderOther .
## age 0.066255878
## hypertension 0.504055849
## heart_disease 0.351618296
## ever_marriedYes .
## work_typeGovt_job .
## work_typeNever_worked .
## work_typePrivate 0.084822677
## work_typeSelf-employed -0.199362826
## Residence_typeUrban .
## avg_glucose_level 0.004446061
## bmi .
## smoking_statusnever smoked .
## smoking_statussmokes 0.245736020
## smoking_statusUnknown -0.110532397
Lasso Regression has discovered the following representative attributes:
age
hypertension
heart_disease
work_type
avg_glucose_level
smoking_status
Even though the coefficient for some classes of smoking_status or work_type has been set to 0, we decided we will incorporate into the model any discrete attribute that has at least one class whose coefficient is not 0.
Backward Elimination Based on p-values:
back <- glm(stroke ~
gender +
age +
hypertension +
heart_disease +
work_type +
avg_glucose_level +
smoking_status +
ever_married +
Residence_type +
bmi, data = stroke)
summary(back)
##
## Call:
## glm(formula = stroke ~ gender + age + hypertension + heart_disease +
## work_type + avg_glucose_level + smoking_status + ever_married +
## Residence_type + bmi, data = stroke)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.29253 -0.06862 -0.02089 0.00605 1.00712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.919e-02 1.589e-02 -1.837 0.066294 .
## genderMale -2.382e-03 5.701e-03 -0.418 0.676089
## genderOther -2.662e-02 1.942e-01 -0.137 0.890983
## age 2.632e-03 2.079e-04 12.661 < 2e-16 ***
## hypertension 4.857e-02 1.014e-02 4.788 1.74e-06 ***
## heart_disease 5.635e-02 1.344e-02 4.194 2.79e-05 ***
## work_typeGovt_job -5.954e-02 1.455e-02 -4.093 4.33e-05 ***
## work_typeNever_worked -2.561e-02 4.232e-02 -0.605 0.545171
## work_typePrivate -4.582e-02 1.221e-02 -3.752 0.000177 ***
## work_typeSelf-employed -6.541e-02 1.483e-02 -4.411 1.05e-05 ***
## avg_glucose_level 3.274e-04 6.561e-05 4.991 6.23e-07 ***
## smoking_statusnever smoked -2.048e-03 8.209e-03 -0.249 0.803014
## smoking_statussmokes 6.056e-03 9.906e-03 0.611 0.540975
## smoking_statusUnknown -6.041e-03 9.338e-03 -0.647 0.517685
## ever_marriedYes -2.898e-02 8.214e-03 -3.528 0.000422 ***
## Residence_typeUrban 1.642e-03 5.542e-03 0.296 0.767057
## bmi -6.135e-04 4.041e-04 -1.518 0.129006
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03758521)
##
## Null deviance: 200.10 on 4908 degrees of freedom
## Residual deviance: 183.87 on 4892 degrees of freedom
## AIC: -2157
##
## Number of Fisher Scoring iterations: 2
# ELIMINATE GENDER FIRST
back <- glm(stroke ~
age +
hypertension +
heart_disease +
work_type +
avg_glucose_level +
smoking_status +
ever_married +
Residence_type +
bmi, data = stroke)
summary(back)
##
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + work_type +
## avg_glucose_level + smoking_status + ever_married + Residence_type +
## bmi, data = stroke)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.29342 -0.06853 -0.02088 0.00578 1.00807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.044e-02 1.562e-02 -1.949 0.051399 .
## age 2.633e-03 2.079e-04 12.666 < 2e-16 ***
## hypertension 4.849e-02 1.014e-02 4.782 1.79e-06 ***
## heart_disease 5.592e-02 1.339e-02 4.175 3.03e-05 ***
## work_typeGovt_job -5.922e-02 1.452e-02 -4.079 4.59e-05 ***
## work_typeNever_worked -2.564e-02 4.231e-02 -0.606 0.544604
## work_typePrivate -4.552e-02 1.218e-02 -3.738 0.000188 ***
## work_typeSelf-employed -6.505e-02 1.480e-02 -4.396 1.12e-05 ***
## avg_glucose_level 3.260e-04 6.552e-05 4.976 6.71e-07 ***
## smoking_statusnever smoked -1.785e-03 8.185e-03 -0.218 0.827377
## smoking_statussmokes 6.146e-03 9.900e-03 0.621 0.534750
## smoking_statusUnknown -5.924e-03 9.330e-03 -0.635 0.525508
## ever_marriedYes -2.897e-02 8.211e-03 -3.528 0.000423 ***
## Residence_typeUrban 1.665e-03 5.540e-03 0.300 0.763815
## bmi -6.137e-04 4.039e-04 -1.519 0.128712
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03757132)
##
## Null deviance: 200.10 on 4908 degrees of freedom
## Residual deviance: 183.87 on 4894 degrees of freedom
## AIC: -2160.8
##
## Number of Fisher Scoring iterations: 2
# ELIMINATE SMOKING STATUS
back <- glm(stroke ~
age +
hypertension +
heart_disease +
work_type +
avg_glucose_level +
ever_married +
Residence_type +
bmi, data = stroke)
summary(back)
##
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + work_type +
## avg_glucose_level + ever_married + Residence_type + bmi,
## data = stroke)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.29500 -0.06793 -0.02047 0.00544 1.01518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.635e-02 1.245e-02 -2.919 0.003532 **
## age 2.630e-03 2.060e-04 12.767 < 2e-16 ***
## hypertension 4.901e-02 1.011e-02 4.846 1.30e-06 ***
## heart_disease 5.669e-02 1.337e-02 4.239 2.28e-05 ***
## work_typeGovt_job -5.514e-02 1.381e-02 -3.992 6.65e-05 ***
## work_typeNever_worked -2.357e-02 4.211e-02 -0.560 0.575644
## work_typePrivate -4.154e-02 1.139e-02 -3.648 0.000268 ***
## work_typeSelf-employed -6.120e-02 1.417e-02 -4.321 1.59e-05 ***
## avg_glucose_level 3.270e-04 6.548e-05 4.994 6.13e-07 ***
## ever_marriedYes -2.842e-02 8.198e-03 -3.467 0.000531 ***
## Residence_typeUrban 1.872e-03 5.536e-03 0.338 0.735268
## bmi -5.998e-04 4.036e-04 -1.486 0.137309
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03756085)
##
## Null deviance: 200.10 on 4908 degrees of freedom
## Residual deviance: 183.94 on 4897 degrees of freedom
## AIC: -2165.2
##
## Number of Fisher Scoring iterations: 2
# ELIMINATE RESIDENCE TYPE
back <- glm(stroke ~
age +
hypertension +
heart_disease +
work_type +
avg_glucose_level +
ever_married +
bmi, data = stroke)
summary(back)
##
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + work_type +
## avg_glucose_level + ever_married + bmi, data = stroke)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.29591 -0.06812 -0.02022 0.00539 1.01425
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.539e-02 1.213e-02 -2.918 0.003535 **
## age 2.631e-03 2.060e-04 12.774 < 2e-16 ***
## hypertension 4.900e-02 1.011e-02 4.845 1.31e-06 ***
## heart_disease 5.666e-02 1.337e-02 4.238 2.30e-05 ***
## work_typeGovt_job -5.514e-02 1.381e-02 -3.992 6.64e-05 ***
## work_typeNever_worked -2.325e-02 4.209e-02 -0.552 0.580791
## work_typePrivate -4.157e-02 1.139e-02 -3.651 0.000264 ***
## work_typeSelf-employed -6.121e-02 1.416e-02 -4.321 1.58e-05 ***
## avg_glucose_level 3.268e-04 6.547e-05 4.991 6.21e-07 ***
## ever_marriedYes -2.842e-02 8.197e-03 -3.467 0.000530 ***
## bmi -5.998e-04 4.036e-04 -1.486 0.137320
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03755406)
##
## Null deviance: 200.10 on 4908 degrees of freedom
## Residual deviance: 183.94 on 4898 degrees of freedom
## AIC: -2167.1
##
## Number of Fisher Scoring iterations: 2
# ELIMINATE BMI
back <- glm(stroke ~
age +
hypertension +
heart_disease +
work_type +
avg_glucose_level +
ever_married,
data = stroke)
summary(back)
##
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + work_type +
## avg_glucose_level + ever_married, data = stroke)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.29470 -0.06823 -0.02127 0.00636 1.01452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.633e-02 9.638e-03 -4.807 1.58e-06 ***
## age 2.640e-03 2.059e-04 12.823 < 2e-16 ***
## hypertension 4.748e-02 1.006e-02 4.718 2.45e-06 ***
## heart_disease 5.717e-02 1.337e-02 4.277 1.93e-05 ***
## work_typeGovt_job -6.046e-02 1.334e-02 -4.532 5.98e-06 ***
## work_typeNever_worked -2.661e-02 4.204e-02 -0.633 0.526742
## work_typePrivate -4.690e-02 1.081e-02 -4.338 1.47e-05 ***
## work_typeSelf-employed -6.624e-02 1.375e-02 -4.816 1.51e-06 ***
## avg_glucose_level 3.146e-04 6.496e-05 4.843 1.32e-06 ***
## ever_marriedYes -2.973e-02 8.151e-03 -3.648 0.000268 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03756333)
##
## Null deviance: 200.10 on 4908 degrees of freedom
## Residual deviance: 184.02 on 4899 degrees of freedom
## AIC: -2166.9
##
## Number of Fisher Scoring iterations: 2
The model discovered by backward elimination based on p-values contains:
age
hypertension
heart_disease
work_type
avg_glucose_level
ever_married
These results are quite different from the ones obtained via lasso.
Forward Selection based on p-values
# STEPWISE REGRESSION: FORWARD, BASED ON P-VALUE
summary(glm(stroke ~ gender, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.14163474 0.09323859 -33.69457684 6.940867e-249
## genderMale 0.06914952 0.14300238 0.48355506 6.287017e-01
## genderOther -9.42442826 324.74370946 -0.02902113 9.768477e-01
summary(glm(stroke ~ age, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.37723102 0.362382724 -20.35757 3.978460e-92
## age 0.07496941 0.005317943 14.09745 3.937696e-45
summary(glm(stroke ~ hypertension, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.364515 0.08332751 -40.377001 0.000000e+00
## hypertension 1.490152 0.16176429 9.211872 3.204876e-20
summary(glm(stroke ~ heart_disease, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.281267 0.07835514 -41.876860 0.000000e+00
## heart_disease 1.656941 0.18990955 8.724893 2.664315e-18
summary(glm(stroke ~ ever_married, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.292245 0.2099351 -20.445578 6.577429e-93
## ever_marriedYes 1.505642 0.2231153 6.748267 1.496213e-11
summary(glm(stroke ~ work_type, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.507278 1.000746 -6.50242696 7.903440e-11
## work_typeGovt_job 3.439225 1.019249 3.37427432 7.401057e-04
## work_typeNever_worked -9.058791 310.293410 -0.02919427 9.767096e-01
## work_typePrivate 3.456401 1.004858 3.43969208 5.823764e-04
## work_typeSelf-employed 3.895544 1.010814 3.85386769 1.162664e-04
summary(glm(stroke ~ Residence_type, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.14372115 0.1021333 -30.7805572 4.771318e-208
## Residence_typeUrban 0.05979319 0.1415116 0.4225322 6.726366e-01
summary(glm(stroke ~ avg_glucose_level, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.43638483 0.175678799 -25.252819 1.054641e-140
## avg_glucose_level 0.01124957 0.001216871 9.244668 2.359720e-20
summary(glm(stroke ~ bmi, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.82844339 0.256842054 -14.905828 3.020465e-50
## bmi 0.02415957 0.008128831 2.972084 2.957860e-03
summary(glm(stroke ~ smoking_status, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6162427 0.1372075 -19.067775 4.677881e-81
## smoking_statusnever smoked -0.4305448 0.1769076 -2.433728 1.494424e-02
## smoking_statussmokes -0.2684148 0.2142419 -1.252858 2.102572e-01
## smoking_statusUnknown -1.2985352 0.2323605 -5.588451 2.291035e-08
# AGE HAS THE LOWEST P-VALUE, SO WE WILL CHOOSE THAT
summary(glm(stroke ~ age + gender, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.42002030 3.692914e-01 -20.09258722 8.567908e-90
## age 0.07500715 5.325108e-03 14.08556251 4.659339e-45
## genderMale 0.09702711 1.491261e-01 0.65063812 5.152801e-01
## genderOther -7.09622848 3.247438e+02 -0.02185178 9.825662e-01
summary(glm(stroke ~ age + hypertension, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.31855136 0.363897407 -20.111579 5.843407e-90
## age 0.07191301 0.005407977 13.297582 2.390765e-40
## hypertension 0.64765101 0.169993562 3.809856 1.390477e-04
summary(glm(stroke ~ age + heart_disease, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.26864292 0.362784710 -20.035693 2.690704e-89
## age 0.07217707 0.005417203 13.323678 1.685920e-40
## heart_disease 0.51464561 0.200561564 2.566023 1.028720e-02
summary(glm(stroke ~ age + ever_married, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.32746499 0.392838810 -18.6525995 1.202843e-77
## age 0.07523089 0.005361696 14.0311745 1.004737e-44
## ever_marriedYes -0.07526895 0.238788309 -0.3152121 7.526007e-01
summary(glm(stroke ~ age + work_type, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.11308109 1.002052e+00 -7.09851313 1.261062e-12
## age 0.07870812 5.770047e-03 13.64080999 2.290028e-42
## work_typeGovt_job -0.54620694 1.070862e+00 -0.51006302 6.100073e-01
## work_typeNever_worked -9.73962475 3.096518e+02 -0.03145348 9.749079e-01
## work_typePrivate -0.36740304 1.057141e+00 -0.34754405 7.281826e-01
## work_typeSelf-employed -0.80286135 1.077045e+00 -0.74542952 4.560121e-01
summary(glm(stroke ~ age + Residence_type, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.38289763 0.368617390 -20.02862001 3.101334e-89
## age 0.07495791 0.005319159 14.09206085 4.249762e-45
## Residence_typeUrban 0.01241056 0.147547409 0.08411238 9.329671e-01
summary(glm(stroke ~ age + avg_glucose_level, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.863851504 0.383221332 -20.520391 1.415697e-93
## age 0.071836016 0.005423383 13.245611 4.783731e-40
## avg_glucose_level 0.005596653 0.001228612 4.555264 5.231984e-06
summary(glm(stroke ~ age + bmi, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.94351988 0.542971877 -14.629708 1.815653e-48
## age 0.07609125 0.005469275 13.912493 5.319231e-44
## bmi 0.01624901 0.011122646 1.460895 1.440443e-01
summary(glm(stroke ~ age + smoking_status, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.39162386 0.403044144 -18.3394895 4.005406e-75
## age 0.07583030 0.005475945 13.8478947 1.309954e-43
## smoking_statusnever smoked -0.07242036 0.183842287 -0.3939266 6.936352e-01
## smoking_statussmokes 0.30663330 0.225449316 1.3600986 1.737987e-01
## smoking_statusUnknown -0.38304204 0.241441298 -1.5864810 1.126302e-01
# AVG_GLUCOSE_LEVEL HAS THE LOWEST P-VALUE, SO WE WILL CHOOSE THAT
summary(glm(stroke ~ age + avg_glucose_level + gender, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.877596737 3.873232e-01 -20.3385629 5.862316e-92
## age 0.071852329 5.426958e-03 13.2398893 5.162488e-40
## avg_glucose_level 0.005567628 1.233785e-03 4.5126402 6.402558e-06
## genderMale 0.039349196 1.506131e-01 0.2612600 7.938920e-01
## genderOther -7.354634943 3.247438e+02 -0.0226475 9.819315e-01
summary(glm(stroke ~ age + avg_glucose_level + hypertension, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.766874299 0.385207756 -20.162819 2.077042e-90
## age 0.069570876 0.005489880 12.672569 8.392051e-37
## avg_glucose_level 0.005046714 0.001245508 4.051933 5.079616e-05
## hypertension 0.547025603 0.172630396 3.168768 1.530868e-03
summary(glm(stroke ~ age + avg_glucose_level + heart_disease, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.751878533 0.385287556 -20.119722 4.958501e-90
## age 0.069711782 0.005507556 12.657479 1.017131e-36
## avg_glucose_level 0.005328059 0.001238933 4.300522 1.703964e-05
## heart_disease 0.418616918 0.202687145 2.065335 3.889128e-02
summary(glm(stroke ~ age + avg_glucose_level + ever_married, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.789594046 0.408601787 -19.0640235 5.025727e-81
## age 0.072241969 0.005453879 13.2459789 4.760343e-40
## avg_glucose_level 0.005623909 0.001230790 4.5693476 4.892447e-06
## ever_marriedYes -0.117295923 0.240601219 -0.4875118 6.258957e-01
summary(glm(stroke ~ age + avg_glucose_level + work_type, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.619543226 1.008696e+00 -7.5538521 4.225702e-14
## age 0.075457883 5.886560e-03 12.8186715 1.288899e-37
## avg_glucose_level 0.005533452 1.230581e-03 4.4966153 6.904374e-06
## work_typeGovt_job -0.514593970 1.072221e+00 -0.4799325 6.312754e-01
## work_typeNever_worked -9.718833702 3.092018e+02 -0.0314320 9.749250e-01
## work_typePrivate -0.336037018 1.058422e+00 -0.3174888 7.508728e-01
## work_typeSelf-employed -0.756468281 1.078583e+00 -0.7013539 4.830822e-01
summary(glm(stroke ~ age + avg_glucose_level + Residence_type, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.869477626 0.388986873 -20.23070232 5.254916e-91
## age 0.071822532 0.005425147 13.23881776 5.236665e-40
## avg_glucose_level 0.005596978 0.001228693 4.55523058 5.232815e-06
## Residence_typeUrban 0.012527162 0.148357406 0.08443907 9.327073e-01
summary(glm(stroke ~ age + avg_glucose_level + bmi, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.043403036 0.538823904 -14.9277027 2.176357e-50
## age 0.072270558 0.005531337 13.0656581 5.173525e-39
## avg_glucose_level 0.005454762 0.001262780 4.3196447 1.562806e-05
## bmi 0.005563145 0.011557519 0.4813442 6.302719e-01
summary(glm(stroke ~ age + avg_glucose_level + smoking_status, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.903594326 0.42602447 -18.5519727 7.861314e-77
## age 0.072903595 0.00557849 13.0686971 4.970964e-39
## avg_glucose_level 0.005478817 0.00123227 4.4461167 8.743648e-06
## smoking_statusnever smoked -0.054455363 0.18515721 -0.2941034 7.686789e-01
## smoking_statussmokes 0.341884635 0.22698826 1.5061776 1.320216e-01
## smoking_statusUnknown -0.303630346 0.24332968 -1.2478147 2.120989e-01
# HYPERTENSION HAS THE LOWEST P-VALUE, SO WE WILL CHOOSE THAT
summary(glm(stroke ~ age + hypertension + avg_glucose_level + gender, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.781770240 3.895709e-01 -19.97523508 9.045703e-89
## age 0.069595683 5.493776e-03 12.66809512 8.884573e-37
## hypertension 0.547147323 1.726137e-01 3.16977869 1.525551e-03
## avg_glucose_level 0.005015926 1.250812e-03 4.01013607 6.068377e-05
## genderMale 0.041195282 1.510594e-01 0.27270914 7.850768e-01
## genderOther -7.312713174 3.247438e+02 -0.02251841 9.820344e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.660740480 0.387152022 -19.787422 3.820918e-87
## age 0.067546641 0.005571399 12.123820 7.899008e-34
## hypertension 0.539613446 0.173054857 3.118164 1.819814e-03
## avg_glucose_level 0.004802004 0.001254526 3.827743 1.293239e-04
## heart_disease 0.404298420 0.203446548 1.987246 4.689510e-02
summary(glm(stroke ~ age + hypertension + avg_glucose_level + ever_married, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.697702775 0.410321307 -18.7601829 1.598484e-78
## age 0.069957658 0.005523751 12.6648831 9.255812e-37
## hypertension 0.546450560 0.172713128 3.1639202 1.556595e-03
## avg_glucose_level 0.005074374 0.001247718 4.0669244 4.763768e-05
## ever_marriedYes -0.110130436 0.241004702 -0.4569638 6.476970e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + work_type, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.546497794 1.008936e+00 -7.47966058 7.451481e-14
## age 0.073295078 5.957261e-03 12.30348668 8.674707e-35
## hypertension 0.559604419 1.729560e-01 3.23553143 1.214165e-03
## avg_glucose_level 0.004977561 1.247178e-03 3.99105953 6.577876e-05
## work_typeGovt_job -0.485869226 1.072735e+00 -0.45292555 6.506024e-01
## work_typeNever_worked -9.701252883 3.093385e+02 -0.03136128 9.749814e-01
## work_typePrivate -0.317674408 1.059079e+00 -0.29995343 7.642127e-01
## work_typeSelf-employed -0.755732210 1.079514e+00 -0.70006707 4.838854e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + Residence_type, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.775463584 0.391047699 -19.8836705 5.635628e-88
## age 0.069551061 0.005491172 12.6659771 9.127677e-37
## hypertension 0.547365843 0.172654226 3.1703009 1.522812e-03
## avg_glucose_level 0.005046535 0.001245634 4.0513801 5.091642e-05
## Residence_typeUrban 0.019029741 0.148805758 0.1278831 8.982415e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + bmi, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.852290780 0.541579006 -14.4988833 1.231369e-47
## age 0.069793224 0.005592872 12.4789605 9.724796e-36
## hypertension 0.543398746 0.173304241 3.1355190 1.715503e-03
## avg_glucose_level 0.004983550 0.001276095 3.9053114 9.410412e-05
## bmi 0.002620905 0.011597749 0.2259839 8.212139e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + smoking_status, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.817975694 0.428113081 -18.2614735 1.677036e-74
## age 0.070743756 0.005641453 12.5399888 4.510592e-36
## hypertension 0.531583861 0.173219210 3.0688505 2.148841e-03
## avg_glucose_level 0.004991703 0.001246534 4.0044672 6.215741e-05
## smoking_statusnever smoked -0.073695985 0.185830434 -0.3965765 6.916798e-01
## smoking_statussmokes 0.342527512 0.227521618 1.5054724 1.322027e-01
## smoking_statusUnknown -0.259185034 0.244351365 -1.0607063 2.888234e-01
# HEART DISEASE HAS THE LOWEST P-VALUE STILL LESS THAN 0.05
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + gender, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.663276556 3.923369e-01 -19.53238744 5.825145e-85
## age 0.067550861 5.577181e-03 12.11200717 9.123390e-34
## hypertension 0.539631103 1.730532e-01 3.11829580 1.819001e-03
## avg_glucose_level 0.004797906 1.258026e-03 3.81383810 1.368251e-04
## heart_disease 0.403109319 2.050314e-01 1.96608582 4.928870e-02
## genderMale 0.007158815 1.524657e-01 0.04695361 9.625502e-01
## genderOther -7.346792655 3.247438e+02 -0.02262335 9.819507e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + ever_married, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.605991130 0.411245668 -18.4950061 2.265166e-76
## age 0.067895001 0.005623411 12.0736330 1.455631e-33
## hypertension 0.539081598 0.173124647 3.1138351 1.846727e-03
## avg_glucose_level 0.004825268 0.001256749 3.8394829 1.232937e-04
## heart_disease 0.400997394 0.203702340 1.9685458 4.900527e-02
## ever_marriedYes -0.089925431 0.241799551 -0.3719007 7.099668e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + work_type, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.505346860 1.009144e+00 -7.43734348 1.027302e-13
## age 0.071219249 6.056140e-03 11.75984242 6.285149e-32
## hypertension 0.550727122 1.733724e-01 3.17655638 1.490348e-03
## avg_glucose_level 0.004732252 1.256763e-03 3.76542799 1.662640e-04
## heart_disease 0.386448475 2.043436e-01 1.89116977 5.860169e-02
## work_typeGovt_job -0.405739320 1.073300e+00 -0.37802974 7.054085e-01
## work_typeNever_worked -9.683999384 3.094148e+02 -0.03129779 9.750321e-01
## work_typePrivate -0.250174174 1.059412e+00 -0.23614443 8.133206e-01
## work_typeSelf-employed -0.677872289 1.080087e+00 -0.62760916 5.302600e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + Residence_type, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.673254253 0.392897024 -19.5299373 6.111441e-85
## age 0.067515342 0.005572265 12.1163186 8.656074e-34
## hypertension 0.540063451 0.173077072 3.1203639 1.806277e-03
## avg_glucose_level 0.004800663 0.001254750 3.8259904 1.302474e-04
## heart_disease 0.405522896 0.203547641 1.9922751 4.634088e-02
## Residence_typeUrban 0.027945581 0.149125639 0.1873962 8.513500e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + bmi, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.776192003 0.543266749 -14.3137639 1.795278e-46
## age 0.067831111 0.005669895 11.9633815 5.526470e-33
## hypertension 0.534552075 0.173750262 3.0765541 2.094083e-03
## avg_glucose_level 0.004716290 0.001284854 3.6706818 2.419044e-04
## heart_disease 0.406811554 0.203530651 1.9987729 4.563293e-02
## bmi 0.003567092 0.011661870 0.3058765 7.596987e-01
summary(glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease + smoking_status, data = stroke, family = binomial(link = "logit")))$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.715380015 0.430076067 -17.9395707 5.790067e-72
## age 0.068759594 0.005730639 11.9985921 3.613914e-33
## hypertension 0.522593784 0.173695992 3.0086692 2.623946e-03
## avg_glucose_level 0.004772674 0.001255004 3.8029161 1.430028e-04
## heart_disease 0.368800613 0.204657023 1.8020423 7.153875e-02
## smoking_statusnever smoked -0.056893666 0.186363000 -0.3052841 7.601498e-01
## smoking_statussmokes 0.320623702 0.228448635 1.4034827 1.604730e-01
## smoking_statusUnknown -0.257303488 0.244818981 -1.0509948 2.932610e-01
# NO OTHER PARAMETERS WITH P-VALUE LESS THAN 0.05. WE WILL STOP HERE.
Final model by forward selection based on p-values contains the following attributes:
age
hypertension
avg_glucose_level
heart_disease
Since lasso, backward elimination and forward selection all yield different models, for the classification part we will stick to the results obtained via:
Lasso regression (more reliable)
Forward selection based on p-values (yields a minimal model)
Predict or classify one variable in your dataset from other variables. Motivate why you choose the outcome and its predictors. Evaluate the performance of your predictive model or the resulting classifier using appropriate metrics and visualizations. Remark: While the other parts of your presentation should nicely fit together as one whole, this prediction part might not fit as nicely, depending on your research question or data narrative. That is okay.
We have decided to implement 3 classifiers: logistic regression, decision trees and Naive Bayes. As we previously said, we will be using the attributes discovered from both lasso regression and forward selection, so we will end up doing 2 different logistic regressions, 2 different trees and 2 different Naive Bayes classifiers.
Logistic Regression
# LOGISTIC REGRESSION 1: BASED ON PARAMETERS DISCOVERED BY LASSO
#age,
#hypertension,
#heart_disease,
#work_type, -
#avg_glucose_level and
#smoking_status
log_reg1 <- glm(stroke ~ age + hypertension + heart_disease + work_type + avg_glucose_level + smoking_status,
data = stroke,
family = binomial(link = "logit"))
summary(log_reg1)
##
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + work_type +
## avg_glucose_level + smoking_status, family = binomial(link = "logit"),
## data = stroke)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1445 -0.2935 -0.1533 -0.0732 3.5399
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.283429 1.034054 -7.044 1.87e-12 ***
## age 0.072774 0.006186 11.765 < 2e-16 ***
## hypertension 0.531510 0.174168 3.052 0.002275 **
## heart_disease 0.348127 0.205533 1.694 0.090309 .
## work_typeGovt_job -0.704910 1.086388 -0.649 0.516431
## work_typeNever_worked -9.799263 308.993942 -0.032 0.974701
## work_typePrivate -0.546146 1.071986 -0.509 0.610422
## work_typeSelf-employed -0.969312 1.092229 -0.887 0.374830
## avg_glucose_level 0.004713 0.001257 3.748 0.000178 ***
## smoking_statusnever smoked -0.062403 0.186637 -0.334 0.738111
## smoking_statussmokes 0.314113 0.229234 1.370 0.170602
## smoking_statusUnknown -0.272125 0.246510 -1.104 0.269632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1728.4 on 4908 degrees of freedom
## Residual deviance: 1363.6 on 4897 degrees of freedom
## AIC: 1387.6
##
## Number of Fisher Scoring iterations: 14
library(ROCR)
pred = predict(log_reg1, type="response")
predObj = prediction(pred, stroke$stroke)
rocObj = performance(predObj, measure="tpr", x.measure="fpr")
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:", round(aucObj@y.values[[1]] ,4)))
pred2 = pred
for (i in 1:4909){
if(pred2[i] > 0.5){
pred2[i] = 1
}
else{
pred2[i] = 0
}
}
pred2 <- factor(pred2)
library(caret)
## Loading required package: lattice
confusionMatrix(pred2,factor(stroke$stroke),)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4700 208
## 1 0 1
##
## Accuracy : 0.9576
## 95% CI : (0.9516, 0.9631)
## No Information Rate : 0.9574
## P-Value [Acc > NIR] : 0.4902
##
## Kappa : 0.0091
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.000000
## Specificity : 0.004785
## Pos Pred Value : 0.957620
## Neg Pred Value : 1.000000
## Prevalence : 0.957425
## Detection Rate : 0.957425
## Detection Prevalence : 0.999796
## Balanced Accuracy : 0.502392
##
## 'Positive' Class : 0
##
## NOTE: The confusionMatrix function in the caret package uses other formulas than the ones in the textbook for PPV, NPV, Prevalence etc. The only ones that are consistent with textbook are those for Sensitivty and Specificity.
## If one desires, the following website can be consulted for the exact formulas for the metrics that this function uses: https://www.rdocumentation.org/packages/caret/versions/6.0-86/topics/confusionMatrix
We have set the threshold at 0.5 when deciding whether someone is classified as having had a stroke or not. The accuracy of this model is very high at more than 95%, the sensitivity is 100%, the positive predictive value is also more than 95%, the area under the curve is 0.8526. Nevertheless, these results are misleading and should not be taken into consideration, because our initial dataset contains too few instances of people who have had a stroke (only about 200). Basically, from the confusion matrix, we can see that every individual was classified as not having stroke, because more than 96% of the dataset is represented by people who didn’t have a stroke. The rest of real 1’s (about 4%) have been classified as 0 too. This is very problematic.
We encounter the same problem with the second logistic model, based on the parameters discovered by forward selection:
# LOGISTIC REGRESSION 2: BASED ON PARAMETERS DISCOVERED BY STEPWISE REGRESSION: FORWARD
log_reg2 <- glm(stroke ~ age + hypertension + avg_glucose_level + heart_disease,
data = stroke,
family = binomial(link = "logit"))
summary(log_reg2)
##
## Call:
## glm(formula = stroke ~ age + hypertension + avg_glucose_level +
## heart_disease, family = binomial(link = "logit"), data = stroke)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0995 -0.2940 -0.1599 -0.0778 3.5885
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.660740 0.387152 -19.787 < 2e-16 ***
## age 0.067547 0.005571 12.124 < 2e-16 ***
## hypertension 0.539613 0.173055 3.118 0.001820 **
## avg_glucose_level 0.004802 0.001255 3.828 0.000129 ***
## heart_disease 0.404298 0.203447 1.987 0.046895 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1728.4 on 4908 degrees of freedom
## Residual deviance: 1374.6 on 4904 degrees of freedom
## AIC: 1384.6
##
## Number of Fisher Scoring iterations: 7
library(ROCR)
pred = predict(log_reg2, type="response")
predObj = prediction(pred, stroke$stroke)
rocObj = performance(predObj, measure="tpr", x.measure="fpr")
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:", round(aucObj@y.values[[1]] ,4)))
pred2 = pred
for (i in 1:4909){
if(pred2[i] > 0.5){
pred2[i] = 1
}
else{
pred2[i] = 0
}
}
pred2 <- factor(pred2)
library(caret)
confusionMatrix(pred2,factor(stroke$stroke))
## Warning in confusionMatrix.default(pred2, factor(stroke$stroke)): Levels are not
## in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4700 209
## 1 0 0
##
## Accuracy : 0.9574
## 95% CI : (0.9514, 0.9629)
## No Information Rate : 0.9574
## P-Value [Acc > NIR] : 0.5184
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9574
## Neg Pred Value : NaN
## Prevalence : 0.9574
## Detection Rate : 0.9574
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
The AUC is a bit smaller, but the accuracy, sensitivity and positive predicted value are very similar. Again, we should not take these results into consideration because the model literally classified every datapoint as 0.
Decision Tree
## DECISION TREE 1: USING THE PARAMETERS DISCOVERED BY LASSO
library(rpart)
library(rpart.plot)
fit <- rpart(stroke ~ age + hypertension + heart_disease + work_type + avg_glucose_level + smoking_status,
method ="class",
data = stroke,
control=rpart.control(maxdepth=6,cp=0.001),
parms=list(split='information'))
rpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=0)
pred <-predict(object = fit, stroke, type = "class")
t <- table(pred, stroke$stroke)
confusionMatrix(t)
## Confusion Matrix and Statistics
##
##
## pred 0 1
## 0 4693 197
## 1 7 12
##
## Accuracy : 0.9584
## 95% CI : (0.9525, 0.9639)
## No Information Rate : 0.9574
## P-Value [Acc > NIR] : 0.3789
##
## Kappa : 0.0989
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99851
## Specificity : 0.05742
## Pos Pred Value : 0.95971
## Neg Pred Value : 0.63158
## Prevalence : 0.95743
## Detection Rate : 0.95600
## Detection Prevalence : 0.99613
## Balanced Accuracy : 0.52796
##
## 'Positive' Class : 0
##
For this decision tree, the number of false positives is very low at 7. On the other hand, the number of true positives is relatively good at 12. This seems like a moderately good model in terms of performance, because the number of false negatives is a bit high at 197.
In order to be classified as a positive, a person must be either:
older than 54 years, but younger than 65, with an average glucose level of 104 or more, currently a smoker, having heart disease or
older than 68, with an average glucose level of 231 or more, working for the government or at a private company, and having hypertension
The positive predicted value is 95.9% (quite high) and the detection rate is 95.6%.
NOTE : We have limited the depth of the tree to 6 to prevent overfitting. We also set the cp parameter to 0.001 so that any split must decrease the overall lack of fit by a factor of 0.001 before being attempted.
## DECISION TREE 2: USING THE PARAMETERS DISCOVERED BY FORWARD SELECTION BASED ON P-VALUE
library(rpart)
library(rpart.plot)
fit2 <- rpart(stroke ~ age + hypertension + heart_disease+ avg_glucose_level,
method ="class",
data = stroke,
control=rpart.control(maxdepth=5,cp=0.001),
parms=list(split='information'))
rpart.plot(fit2, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=0)
pred <-predict(object = fit2, stroke, type = "class")
t <- table(pred, stroke$stroke)
confusionMatrix(t)
## Confusion Matrix and Statistics
##
##
## pred 0 1
## 0 4697 205
## 1 3 4
##
## Accuracy : 0.9576
## 95% CI : (0.9516, 0.9631)
## No Information Rate : 0.9574
## P-Value [Acc > NIR] : 0.4902
##
## Kappa : 0.0344
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99936
## Specificity : 0.01914
## Pos Pred Value : 0.95818
## Neg Pred Value : 0.57143
## Prevalence : 0.95743
## Detection Rate : 0.95681
## Detection Prevalence : 0.99857
## Balanced Accuracy : 0.50925
##
## 'Positive' Class : 0
##
For the second decision tree, the situation is similar, in that the number of false positives is 3 and the number of true positives is 4. This seems like a moderately good model in terms of performance, because, again, the number of false negatives is high at 205.
In order to be classified as a positive, a person must be older than 68, with an average glucose level between 127 and 132.
The positive predicted value is 95.8% (quite high) and the detection rate is 95.6%.
NOTE : We have limited the depth of the tree to 5 to prevent overfitting. We also set the cp parameter to 0.001 so that any split must decrease the overall lack of fit by a factor of 0.001 before being attempted.
Naive Bayes
Our second option was to do a Naive Bayes classifier, in the hope that we will obtain more reliable results:
library(e1071)
naive1 <- naiveBayes(stroke ~ age + hypertension + heart_disease + work_type + avg_glucose_level + smoking_status,
data = stroke,
laplace = 0.01,
na.action = na.pass)
# NOTE: We CAN plug in numerical variables in the Naive Bayes classifier, not only categorical ones.
# From the documentation, the naiveBayes function uses the standard deviation of the numerical variable when computing probability scores, which seems like a reliable method.
library(caTools)
pred <-predict(object = naive1, stroke, type = "class")
t <- table(pred, stroke$stroke)
library(caret,quietly = TRUE)
confusionMatrix(t)
## Confusion Matrix and Statistics
##
##
## pred 0 1
## 0 4214 121
## 1 486 88
##
## Accuracy : 0.8763
## 95% CI : (0.8668, 0.8854)
## No Information Rate : 0.9574
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1732
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8966
## Specificity : 0.4211
## Pos Pred Value : 0.9721
## Neg Pred Value : 0.1533
## Prevalence : 0.9574
## Detection Rate : 0.8584
## Detection Prevalence : 0.8831
## Balanced Accuracy : 0.6588
##
## 'Positive' Class : 0
##
The results from the first Naive Bayes classifier are indeed a lot more reliable. The model did classifiy some instances as 1, with 121 false negatives and 88 true positives. The accuracy is high at 87.6%, as well as the positive predictive value at 97%. The detection rate is 85%. All in all, this is a much better classifier than logistic regression.
We also implemented another Naive Bayes classifier based on the minimal model discovered through forward selection based on p-values:
naive2 <- naiveBayes(stroke ~ age + hypertension + avg_glucose_level + heart_disease,
data = stroke,
laplace = 0.01,
na.action = na.pass)
pred <-predict(object = naive2, stroke, type = "class")
t <- table(pred, stroke$stroke)
confusionMatrix(t)
## Confusion Matrix and Statistics
##
##
## pred 0 1
## 0 4255 124
## 1 445 85
##
## Accuracy : 0.8841
## 95% CI : (0.8748, 0.8929)
## No Information Rate : 0.9574
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.18
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9053
## Specificity : 0.4067
## Pos Pred Value : 0.9717
## Neg Pred Value : 0.1604
## Prevalence : 0.9574
## Detection Rate : 0.8668
## Detection Prevalence : 0.8920
## Balanced Accuracy : 0.6560
##
## 'Positive' Class : 0
##
This model is comparable to the previous one. It yielded 124 false negatives and 85 true positives. The accuracy is 88.4%, and the positive predicted value is 97%. The detection rate is also higher at 86.68%.
A brief summary of your findings from the previous sections without repeating your statements from earlier as well as a discussion of what you have learned about the data and your research question(s). You should also discuss any shortcomings of your current study, either due to data collection or methodology, and include ideas for possible future research.
Findings:
Curiously enough, bmi was not included in any classification model after feature selection, even though, from a practical perspective, we would expect it to be somewhat important.
Logistic regression yields very unreliable results for imbalanced datasets such as this one. All the metrics from the confusion matrix are artificial and should not be believed.
Naive Bayes Classifier and Decision Tree seems to be more robust to imbalance in the dataset, and are better predictive models for stroke in this context.
Future Directions: